Skip to content

[AutoDiff] Autodiff 9: Guard against LLVM worker-thread stack overflow from large per-task adstack budget#495

Open
duburcqa wants to merge 1 commit intoduburcqa/split_llvm_adstack_runtime_overflowfrom
duburcqa/llvm_adstack_safety
Open

[AutoDiff] Autodiff 9: Guard against LLVM worker-thread stack overflow from large per-task adstack budget#495
duburcqa wants to merge 1 commit intoduburcqa/split_llvm_adstack_runtime_overflowfrom
duburcqa/llvm_adstack_safety

Conversation

@duburcqa
Copy link
Copy Markdown
Contributor

@duburcqa duburcqa commented Apr 17, 2026

Guard against LLVM worker-thread stack overflow from large per-task adstack budget

CPU-only codegen guard. Rejects compilation when the cumulative AdStackAllocaStmt::size_in_bytes() in a single LLVM task crosses the ~256 KB secondary-thread stack budget; without it, the frame silently clobbers adjacent stack memory and the reverse pass returns zero / garbage gradients.

TL;DR

void TaskCodeGenLLVM::visit(AdStackAllocaStmt *stmt) {
  QD_ASSERT_INFO(stmt->max_size > 0, "...");
  auto type = llvm::ArrayType::get(llvm::Type::getInt8Ty(*llvm_context), stmt->size_in_bytes());

  if (arch_is_cpu(current_arch())) {
    constexpr std::size_t kFnScopeAdStackBudgetBytes = 256 * 1024;
    ad_stack_fn_scope_bytes_ += stmt->size_in_bytes();
    QD_ERROR_IF(ad_stack_fn_scope_bytes_ > kFnScopeAdStackBudgetBytes,
                "LLVM autodiff-stack budget exceeded: cumulative `AdStackAllocaStmt` size {} bytes in task "
                "'{}' crosses the {} byte function-scope budget. ...",
                ad_stack_fn_scope_bytes_, kernel_name, kFnScopeAdStackBudgetBytes);
  }

  auto alloca = create_entry_block_alloca(type, sizeof(int64));
  // ...
}

The check is CPU-only by design: on CUDA / AMDGPU the same LLVM allocas lower to per-thread GPU local memory (a separate address space sized by the driver, not shared with the CPU call stack), so the 256 KB CPU-stack budget is not meaningful there. A non-gated version of the check would falsely reject valid GPU kernels with f64 loop-carried variables (4 adstacks at ad_stack_size=4096 already cross 256 KB).

Why 256 KB

macOS secondary threads default to a ~512 KB stack. The worker-thread pool used by the LLVM JIT runs on those. A function-scope alloca for an adstack sits on the LLVM stack frame; if the sum of its sizes across a task crosses the thread's stack limit, the frame corrupts adjacent stack memory (typically the next page, sometimes guard pages). Downstream reverse-mode accumulators read zero, producing silently-wrong gradients with no crash.

256 KB is a conservative upper bound that leaves ~256 KB for other locals and nested call frames. Linux defaults to ~8 MB per secondary thread, so the limit is strictly conservative there — the codegen is more protective on Linux than strictly necessary, which is fine.

Why QD_ERROR_IF and not throw

QD_ERROR_IF logs a descriptive message and then calls QD_UNREACHABLE (compiler-assisted abort). The codegen runs inside the LLVM compilation worker thread pool, where a C++ exception thrown from a worker doesn't cleanly propagate back to the Python-level caller (pybind11's exception translation only works at pybind binding boundaries, which are on the main thread). Throwing QuadrantsRuntimeError from here results in std::terminate() and a bare SIGABRT — a worse user experience than the guard's current behaviour, which at least logs the message before aborting.

The trade-off is that the abort shows up as a SIGABRT in the child process rather than a catchable Python exception. The testing approach below handles that.

Obsoleted by subsequent PRs

Autodiff 12 (LLVM heap-backed adstack) moves the storage off the worker thread stack entirely, making this guard unnecessary. When that PR lands, the guard — along with the ad_stack_fn_scope_bytes_ accumulator field — is removed. This PR is still useful in the interim to make the silent-corruption failure loud, and to provide coverage for any future path that might re-introduce function-scope allocas for adstack storage.

Changes

quadrants/codegen/llvm/codegen_llvm.{h,cpp}

  • TaskCodeGenLLVM gains an ad_stack_fn_scope_bytes_ accumulator field, reset to 0 at the top of each offloaded task.
  • init_offloaded_task_function resets the accumulator.
  • visit(AdStackAllocaStmt) tallies per-alloca size and raises via QD_ERROR_IF when the per-task total crosses 256 KB on CPU.

Tests

test_adstack_codegen_budget_guard_runs_in_child_process

Runs the overflowing kernel in a child process (since QD_ERROR_IF aborts the process rather than raising a catchable Python exception) and asserts:

  1. The child exits with a non-zero returncode (the guard fired and terminated the process).
  2. The guard's message ("autodiff-stack budget exceeded") appears in the child's combined stdout/stderr.

Skip-gated on qd.cpu having the adstack and f64 extensions. Kernel shape: five qd.f64 loop-carried variables applying qd.sin inside a dynamic range(n_iter[None]) at ad_stack_size=4096. Each adstack is 8 + 4096 * 16 = 65,544 bytes ≈ 64 KB; five × 64 KB = 320 KB, comfortably past the 256 KB guard.

The "dynamic n_iter from a field" shape is load-bearing: with a Python-literal range(3) the compile-time trip count is known and the determine-ad-stack-size pass sizes each stack to 3 slots, not default_ad_stack_size=4096. Only a runtime-bound range defeats that pass and leaves each stack at the full default_ad_stack_size.

Side-effect audit

Concern Verdict
Existing CPU kernels with small adstacks Unaffected — 256 KB budget is generous for default-sized adstacks (32 slots × entry bytes).
CUDA / AMDGPU Guard is gated on arch_is_cpu(current_arch()); never fires on GPU.
Linux vs macOS 256 KB is conservative on both; Linux stacks are larger by default, so the guard is strictly more protective than necessary there.
Valid-but-large GPU kernels Not rejected (CPU-only gate).
Silent corruption path Surfaced with a descriptive abort message instead of wrong gradients.
Future removal Explicitly obsoleted by the LLVM heap-backed PR (Autodiff 12), which removes the guard and its accumulator.

Stack

Autodiff 9 of 13. Third commit of the "LLVM adstack safety" triplet split and the top-most of that split (this PR is #495). Based on #535 (runtime overflow). Followed by #490 (SPIR-V adstack).

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a5d3009268

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread quadrants/runtime/program_impls/llvm/llvm_program.h Outdated
Comment thread quadrants/runtime/llvm/runtime_module/runtime.cpp Outdated
Comment thread tests/python/test_adstack.py Outdated
@duburcqa duburcqa force-pushed the duburcqa/fix_ad_correctness branch from e281a6e to 196977b Compare April 17, 2026 11:37
@duburcqa duburcqa force-pushed the duburcqa/llvm_adstack_safety branch from a5d3009 to 79a56b0 Compare April 17, 2026 11:37
@duburcqa duburcqa force-pushed the duburcqa/fix_ad_correctness branch from 196977b to c73cb3d Compare April 17, 2026 11:44
@duburcqa duburcqa force-pushed the duburcqa/llvm_adstack_safety branch 2 times, most recently from 91f44c5 to 6abc4aa Compare April 17, 2026 11:53
@duburcqa duburcqa force-pushed the duburcqa/fix_ad_correctness branch from c73cb3d to 3d0ecaf Compare April 17, 2026 12:12
@duburcqa duburcqa force-pushed the duburcqa/llvm_adstack_safety branch from 6abc4aa to bedaa69 Compare April 17, 2026 12:12
@duburcqa duburcqa force-pushed the duburcqa/fix_ad_correctness branch from 3d0ecaf to 7b52cc8 Compare April 17, 2026 12:18
@duburcqa duburcqa force-pushed the duburcqa/llvm_adstack_safety branch from bedaa69 to 6ba6b0a Compare April 17, 2026 12:18
@duburcqa duburcqa force-pushed the duburcqa/fix_ad_correctness branch from 7b52cc8 to 2a30384 Compare April 17, 2026 12:29
@duburcqa duburcqa force-pushed the duburcqa/llvm_adstack_safety branch from 6ba6b0a to 7e7b15d Compare April 17, 2026 12:29
@duburcqa duburcqa force-pushed the duburcqa/fix_ad_correctness branch from 2a30384 to b390eb0 Compare April 17, 2026 12:31
@duburcqa duburcqa force-pushed the duburcqa/llvm_adstack_safety branch from 7e7b15d to d68e626 Compare April 17, 2026 12:31
@duburcqa duburcqa force-pushed the duburcqa/fix_ad_correctness branch from b390eb0 to 4984a2b Compare April 17, 2026 12:42
@duburcqa duburcqa force-pushed the duburcqa/llvm_adstack_safety branch from d68e626 to bd14f25 Compare April 17, 2026 12:43
@hughperkins hughperkins changed the title [AutoDiff] Surface LLVM adstack push/pop overflow as a Python exception [AutoDiff] Autodiff 7: Surface LLVM adstack push/pop overflow as a Python exception Apr 17, 2026
@hughperkins
Copy link
Copy Markdown
Collaborator

Opus review of test factoriztaion:

Looking at the four tests side by side:

Test Loop type Body Expected grad
test_adstack_sum_fixed_coeff dynamic range(3) y += v 3.0
test_adstack_sum_constant_coeffs dynamic range(3) y += v * (a+1) 6.0
test_adstack_sum_static_loop_correct static qd.static(range(3)) y += v 3.0
test_adstack_basic_gradient dynamic range(10) v = v * 0.95 + 0.01 0.95^10

Tests 2, 3, and 4 share identical setup/teardown/assertion structure and only differ in two dimensions: dynamic vs static loop, and fixed vs varying coefficient. These three could collapse into a single parametrized test using if qd.static(...) to switch at compile time:

@pytest.mark.parametrize("use_static_loop", [True, False])
@pytest.mark.parametrize("use_varying_coeff", [True, False])
@test_utils.test(require=qd.extension.adstack, ad_stack_experimental_enabled=True)
def test_adstack_sum_linear(use_static_loop, use_varying_coeff):
    x = qd.field(qd.f32)
    y = qd.field(qd.f32)
    qd.root.dense(qd.i, 1).place(x, x.grad)
    qd.root.place(y, y.grad)

    @qd.kernel
    def compute():
        for i in x:
            v = x[i]
            if qd.static(use_static_loop):
                r = qd.static(range(3))
            else:
                r = range(3)
            for a in r:
                if qd.static(use_varying_coeff):
                    y[None] += v * qd.cast(a + 1, qd.f32)
                else:
                    y[None] += v

    x[0] = 1.0
    y[None] = 0.0
    compute()
    y.grad[None] = 1.0
    x.grad[0] = 0.0
    compute.grad()

    expected = sum((a + 1) for a in range(3)) if use_varying_coeff else 3.0
    assert x.grad[0] == test_utils.approx(expected, rel=1e-4)

This gives you 4 combinations (including static + varying coeff, which is new free coverage) in one test, and eliminates ~50 lines of duplication.

test_adstack_basic_gradient should stay separate — it's the only one with loop-carried state mutation (v = v * 0.95 + 0.01), which is a structurally different pattern. Though it could benefit from a brief comment explaining that distinction.

@hughperkins
Copy link
Copy Markdown
Collaborator

Please could you address Opus's suggetsions on test factroization.

// The stack index `n` is clamped on read so that overflow (push past capacity) does not let subsequent pops and
// top-accesses underflow it and index far out of bounds. The corresponding stack_push sets
// `runtime->adstack_overflow_flag` and skips the increment instead of trapping, so the host-side launcher
// surfaces the failure as a Python exception rather than killing the process via __builtin_trap. When n == 0
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

@hughperkins
Copy link
Copy Markdown
Collaborator

Asking Opus about any runtime impact:

Hot-path branches in stack operations. stack_push, stack_pop, and stack_top_primal are called on every iteration of every AD loop that uses the adstack. This PR adds a new
branch to each:

• stack_push: bounds check if (n + 1 > max_num_elements) before every push, plus an extra LLVMRuntime *runtime pointer argument threaded through every call site.
• stack_pop: underflow guard if (n > 0) before every pop.
• stack_top_primal: conditional n > 0 ? n - 1 : 0 instead of a bare n - 1.

In normal (non-overflowing) operation the branches are always not-taken, so CPU branch prediction should handle them well. On GPU, all threads in a warp would take the same
path (not-taken), so no divergence cost. But they do add instruction count and register pressure to what is essentially the innermost loop of the backward pass.

synchronize() poll. Every qd.sync() now calls check_adstack_overflow(), which makes a JIT call into the runtime to read and reset the flag, plus a result buffer fetch. This is
once per sync (not per iteration), so it's unlikely to be measurable unless the user syncs at very high frequency.

In practice, the cost is probably negligible — a couple of predictable branches per stack operation. But if someone wanted to benchmark it, the stack push/pop overhead in a
tight AD loop with many iterations would be the place to look.

@hughperkins
Copy link
Copy Markdown
Collaborator

Opus response for test coverage:

Code changes in this PR:

Change Exercised by test?
stack_push overflow check + early return Yes — the 64-iteration kernel triggers it
stack_push sets runtime->adstack_overflow_flag = 1 Yes — the test expects the resulting exception
stack_push new LLVMRuntime *runtime parameter Yes — indirectly, the kernel wouldn't compile otherwise
stack_pop underflow guard (if (n > 0)) No — only reachable after overflow, when the reversed loop pops more times than were successfully pushed. The test raises before observing this path.
stack_top_primal clamping (n > 0 ? n - 1 : 0) No — same situation, only reachable post-overflow.
LLVMRuntime::adstack_overflow_flag field Yes
runtime_retrieve_and_reset_adstack_overflow Yes — called by check_adstack_overflow
LlvmRuntimeExecutor::check_adstack_overflow() Yes
check_adstack_overflow early return when llvm_runtime_ == nullptr No — only reachable during early teardown before materialize_runtime
result_buffer_cache_ caching in materialize_runtime Yes — indirectly, needed for check_adstack_overflow to work
LlvmProgramImpl::synchronize() calls check_adstack_overflow Yes
LlvmProgramImpl::finalize() sets finalizing_ = true No — no test verifies that overflow during teardown doesn't crash
codegen_llvm.cpp passes get_runtime() to stack_push Yes — indirectly
internal_functions.h test_stack updated No — no test calls test_internal_func_args / test_stack

Untested functionality implied by the design:

  1. Flag reset after catch. The flag is reset in runtime_retrieve_and_reset_adstack_overflow, but no test verifies that after catching the overflow exception, a subsequent qd.sync() does not raise again. A second sync after the pytest.raises block would cover this.

  2. Increasing ad_stack_size resolves the overflow. The error message tells users to pass ad_stack_size=N to qd.init(). No test verifies that doing so actually makes the same kernel succeed.

  3. Overflow on SPIR-V. The test accepts both AssertionError and RuntimeError, and the docstring describes the SPIR-V path, but the SPIR-V code path isn't in this diff. If SPIR-V doesn't have equivalent overflow detection, the test would silently pass on LLVM-only CI without ever validating the SPIR-V claim.

  4. Multi-threaded overflow. The comment in stack_push explains the race is benign (all threads write the same sentinel). No test uses a multi-element field where multiple threads overflow simultaneously.

  5. Gradients are actually wrong without the safety check. The test only checks that an exception is raised, not that the gradients would have been silently wrong without it. A companion test at a just-under-capacity iteration count showing correct gradients would strengthen the argument.

  6. Teardown safety. The finalizing_ flag exists specifically so that an overflow during ~Program() → finalize() → synchronize() doesn't throw into a destructor and terminate the process. No test covers this path — e.g., triggering overflow and then letting the qd.init() scope exit without an explicit sync.

@hughperkins
Copy link
Copy Markdown
Collaborator

hughperkins commented Apr 17, 2026

Please could you add tests for untested fucntionality implied by the design:

  1. Flag reset after catch. The flag is reset in runtime_retrieve_and_reset_adstack_overflow, but no test verifies that after catching the overflow exception, a subsequent qd.sync() does not raise again. A second sync after the pytest.raises block would cover this.

  2. Increasing ad_stack_size resolves the overflow. The error message tells users to pass ad_stack_size=N to qd.init(). No test verifies that doing so actually makes the same kernel succeed.

  3. Overflow on SPIR-V. The test accepts both AssertionError and RuntimeError, and the docstring describes the SPIR-V path, but the SPIR-V code path isn't in this diff. If SPIR-V doesn't have equivalent overflow detection, the test would silently pass on LLVM-only CI without ever validating the SPIR-V claim.

  4. Multi-threaded overflow. The comment in stack_push explains the race is benign (all threads write the same sentinel). No test uses a multi-element field where multiple threads overflow simultaneously.

5. Gradients are actually wrong without the safety check. The test only checks that an exception is raised, not that the gradients would have been silently wrong without it. A companion test at a just-under-capacity iteration count showing correct gradients would strengthen the argument.

  1. Teardown safety. The finalizing_ flag exists specifically so that an overflow during ~Program() → finalize() → synchronize() doesn't throw into a destructor and terminate the process. No test covers this path — e.g., triggering overflow and then letting the qd.init() scope exit without an explicit sync.

@hughperkins
Copy link
Copy Markdown
Collaborator

Oh, also please add tests for the rows in the table above with 'no', ie:

  • stack_pop underflow guard (if (n > 0))
  • stack_top_primal clamping (n > 0 ? n - 1 : 0)
  • check_adstack_overflow early return when llvm_runtime_ == nullptr
  • LlvmProgramImpl::finalize() sets finalizing_ = true
  • internal_functions.h test_stack updated

@duburcqa duburcqa force-pushed the duburcqa/fix_ad_correctness branch from 4984a2b to 1cb9ad7 Compare April 17, 2026 15:57
@duburcqa duburcqa force-pushed the duburcqa/llvm_adstack_safety branch from bd14f25 to d1f04cc Compare April 17, 2026 15:57
Comment thread quadrants/codegen/llvm/codegen_llvm.cpp Outdated
Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — all review issues addressed, including the CPU-only arch guard on the 256 KB adstack budget check.

Extended reasoning...

Overview

The PR adds adstack overflow detection to the LLVM autodiff runtime: stack_push now sets runtime->adstack_overflow_flag via a relaxed atomic store on overflow, stack_pop/stack_top_primal have underflow/clamping guards, LlvmRuntimeExecutor::check_adstack_overflow() polls the flag on every synchronize(), and LlvmProgramImpl::pre_finalize() suppresses the poll during teardown. Additionally AdStackAllocaStmt::size_in_bytes() is corrected from sizeof(int32) to sizeof(int64) to match the runtime layout. 11 files touched across codegen, IR, runtime, and tests.

Security risks

No security-sensitive surfaces (no auth, crypto, or permission code). The relaxed-atomic fix for the data race in stack_push is a correctness improvement.

Level of scrutiny

This is production-critical runtime code (autodiff correctness on all LLVM backends), but the changes are well-bounded safety additions rather than logic rewrites. Every bug raised across multiple review rounds was addressed:

  • Comment inaccuracies (error_code vs adstack_overflow_flag) — fixed
  • finalizing_ timing (pre_finalize() before teardown syncs) — fixed
  • sizeof header mismatch — fixed
  • Atomic write for multi-threaded overflow — fixed
  • CPU-only arch guard on the 256 KB budget check — fixed (fixup commit 851d8fd)
  • test_stack heap leak — fixed
  • Test ad_stack_size pins and extension guards — fixed

Other factors

Test coverage is comprehensive: overflow raises, flag reset, teardown safety, multi-threaded overflow, and large-capacity resolution are all exercised. The bug hunting system found no new issues in the final state. The unresolved inline thread on the GPU budget check is moot because the arch guard is present in the submitted diff.

@hughperkins
Copy link
Copy Markdown
Collaborator

Opus says this PR is three ~independent fixes. Pelase could we split into three PRs?

These three fixes a are orthogonal and could be pr'd sepraately?

Yes — all three are independently mergeable, with one soft ordering preference. Sketch:

Fix 3: AdStackAllocaStmt::size_in_bytes header size

• Scope: 1 file, ~4 lines (quadrants/ir/statements.h).
• Dependencies: none. Pre-existing bug — runtime always used u64 for the header, LLVM alloca was 4 B short.
• Test: none new required (existing tests now allocate the correct size, and the extended test_stack in fix 1 would surface it, but the fix stands on its own).
• Risk: near-zero. Just makes the alloca 4 B larger.
• Could ship: tomorrow. Smallest, cleanest, highest-confidence change in the PR.

Fix 1: runtime overflow → Python exception + teardown safety

• Scope: ~8 files (runtime.cpp, internal_functions.h, llvm_runtime_executor.{cpp,h}, llvm_program.h, program.cpp, program_impl.h, codegen_llvm.cpp for the stack_push signature
change + Python tests).
• Dependencies: none semantically, but technically the stack_push(runtime, stack, max, elem_size) signature change touches codegen_llvm.cpp too — any PR landing in parallel
that touches that callsite needs rebasing.
• Self-contained test coverage: yes — all five new Python tests + the extended test_stack in internal_functions.h.
• Risk: moderate. New virtual hook (pre_finalize), new runtime flag, changes the always-on path in synchronize(). The teardown subprocess test is specifically pinning one
worst-case interaction.
• Could ship: as its own PR. It's the biggest piece but internally coherent; the push-bounds-check, pop-clamp, top-of-stack clamp, flag-poll, and pre_finalize teardown hook
are all one story: "overflow should surface as an exception, not as silent corruption, even in the unsynced-exit path."

Fix 2: codegen per-task stack-budget guard

• Scope: 2 files (codegen_llvm.cpp, codegen_llvm.h).
• Dependencies: ideally lands after fix 3 so the 256 KB budget math sums accurate per-adstack sizes (not 4 B-too-small figures). Not a correctness issue at this budget
granularity, but technically cleaner.
• Self-contained test coverage: none in this PR. The PR's five new tests all exercise fix 1 (runtime overflow). Fix 2 is the codegen-time check and should have at least one
test that deliberately crosses 256 KB at ad_stack_size=4096 and asserts the QD_ERROR_IF fires. Splitting this out as its own PR makes the absence of that test more visible
and easier to address.
• Risk: moderate. CPU-only gating with a hard-coded constant; a kernel that works fine today but happens to straddle 256 KB will start failing at compile time. Worth a
dedicated review pass.
• Could ship: as its own PR, ideally after fix 3, and ideally with a test added first.

Suggested order

  1. Fix 3 first (trivial, unblocks the others). One-line PR.
  2. Fix 1 second (big, self-contained, carries the Python tests that are directly visible to users).
  3. Fix 2 third (smallest of the two behavior changes, but wants a test the current PR doesn't include — easier to add that test in a dedicated PR than to stretch this one
    further).

Why they're getting reviewed together anyway

They share a theme — "stop silent wrong gradients in the adstack path" — and they were probably discovered together while chasing the same symptom. That's normal. But for
review, they land in different files with different risk profiles and different test needs, and reviewers can sign off on them independently. If CI time per PR is cheap,
splitting is a straight win for reviewability and bisectability. If the team prefers bigger atomic changes, the current shape is defensible, but in that case the PR description
should at least be explicit that fix 2 is untested in this PR.

it sounds like we need the test from fix 1 for fix 3?

Yes — kind of. Let me be precise about what "needs" means here.

Fix 3 is provably correct by inspection

The runtime reads/writes the header through *(u64 *)stack at every site (stack_init, stack_push, stack_top_primal). The alloca was sized as if the header were int32. Changing it
to int64 makes the alloca size match the layout the runtime has always used. There's no semantic ambiguity to argue over.

But it has no executable validation in isolation

Without fix 1, the only test exercising the adstack at the C++ level is the old test_stack, which does 4 pushes into a max_size=16 stack — nowhere near the corrupted region. The
corruption from the int32 sizing is in the last 4 bytes of the alloca, which only gets touched when pushes approach max_size. So an old-test_stack-only validation regime would
happily pass with fix 3 reverted.

That's why I said "obviously correct by inspection but with no executable proof" — and why fix 1's rewritten test_stack (which pushes to max_size and reads *(u64 *)stack) is the
test that incidentally pins fix 3.

What this means for splitting

Three options, in increasing test discipline:

  1. Ship fix 3 alone, no test. Justified on "matches runtime layout, strictly increases alloca size by 4 bytes, can't make anything worse." Defensible but leaves the bug class
    (alloca-size vs runtime-layout drift) without a regression pin.

  2. Ship fix 3 alone, with a 5-line test. Add a minimal C++ test (or piggyback on existing test_stack) that pushes to max_size and asserts *(u64 *)stack == max_size. That single
    assertion would have failed under the old int32 sizing because the high 4 bytes of the header would have been clobbered by the first slot's writes. Smallest validating
    change.

  3. Ship fix 3 + fix 1 together (current shape, sort of). Fix 1's test_stack rewrite covers fix 3 as a side effect. You don't need to write a separate test, but you also can't
    bisect a fix-3 regression to fix 3 itself.

If splitting, option 2 is the cleanest — a one-liner fix plus a one-line assertion is still a tiny PR, and it gives fix 3 its own regression pin instead of relying on fix 1's
coverage. Then fix 1 lands on top with its more thorough overflow-path tests.

The general principle: a fix that's "obviously right" but covered only by an unrelated test is one accidental refactor away from being silently undone. A 5-line dedicated test is
cheap insurance.

@hughperkins
Copy link
Copy Markdown
Collaborator

So lets' split into the following PRs please;

  • AdStackAllocaStmt::size_in_bytes header size (with small 5-line test)
  • runtime overflow → Python exception + teardown safety
  • codegen per-task stack-budget guard

Comment thread quadrants/runtime/llvm/llvm_runtime_executor.h
@duburcqa duburcqa force-pushed the duburcqa/fix_ad_correctness branch from 8a6f423 to 930d6d9 Compare April 21, 2026 06:40
@duburcqa duburcqa force-pushed the duburcqa/llvm_adstack_safety branch from 2e738b1 to e4c60f6 Compare April 21, 2026 06:40
@duburcqa duburcqa changed the base branch from duburcqa/fix_ad_correctness to duburcqa/split_llvm_adstack_runtime_overflow April 21, 2026 06:44
@duburcqa duburcqa changed the title [AutoDiff] Autodiff 7: Surface LLVM adstack push/pop overflow as a Python exception [AutoDiff] Autodiff 9: Guard against LLVM worker-thread stack overflow from large per-task adstack budget Apr 21, 2026
@duburcqa duburcqa force-pushed the duburcqa/split_llvm_adstack_runtime_overflow branch from d1a9c46 to 97533d1 Compare April 21, 2026 06:59
@duburcqa duburcqa force-pushed the duburcqa/llvm_adstack_safety branch from e4c60f6 to ba3976d Compare April 21, 2026 06:59
Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional findings (outside current diff — PR may have been updated during review):

  • 🔴 quadrants/runtime/llvm/runtime_module/runtime.cpp:1107-1112 — The PR deletes all eight shuffle_down runtime helpers (amdgpu_shuffle_down_i32/f32/i64/f64 and cuda_shuffle_down_i32/f32/i64/f64) from runtime.cpp as incidental cleanup, but codegen_cuda.cpp (~lines 746-752) and codegen_amdgpu.cpp (~lines 449-455) still emit JIT calls to these exact symbol names when lowering subgroup shuffle_down operations. Any CUDA or AMDGPU kernel that uses qd.subgroup.shuffle_down() will fail at JIT link time with an undefined symbol error. Restore the eight deleted functions or update the codegen sites to inline the shuffle_down logic.

    Extended reasoning...

    What the bug is and how it manifests

    The PR diff shows two deletion hunks in runtime.cpp: one removing amdgpu_shuffle_down_i32/f32/i64/f64 (around line 1107) and one removing cuda_shuffle_down_i32/f32/i64/f64 (around line 1143). These are described as incidental cleanup but they are not dead code — they are JIT-linked runtime symbols that GPU codegen actively resolves at kernel compilation time.

    The specific code path that triggers it

    codegen_cuda.cpp contains an emit_cuda_shuffle_down() function (approximately lines 744–755) that calls:
    call("cuda_shuffle_down_i32", offset, value)
    call("cuda_shuffle_down_f32", offset, value)
    call("cuda_shuffle_down_f64", offset, value)
    call("cuda_shuffle_down_i64", offset, value)

    codegen_amdgpu.cpp contains an emit_amdgpu_shuffle_down() function (approximately lines 447–458) that calls:
    call("amdgpu_shuffle_down_i32", offset, value)
    call("amdgpu_shuffle_down_f32", offset, value)
    call("amdgpu_shuffle_down_f64", offset, value)
    call("amdgpu_shuffle_down_i64", offset, value)

    Neither codegen file is modified by this PR, so these eight call() sites remain unchanged and still reference the deleted symbols.

    Why existing code does not prevent it

    The call() invocations in LLVM codegen resolve symbol names from the JIT-linked runtime bitcode module at kernel compilation time. There is no compile-time check that the symbol exists in the module — the linker error only surfaces when a CUDA or AMDGPU kernel that exercises a subgroup.shuffle_down() operation is compiled. The PR's own test suite does not include CUDA/AMDGPU subgroup tests (the tests added are all CPU adstack tests), so the breakage escapes CI.

    What the impact would be

    Any user calling qd.subgroup.shuffle_down() on a CUDA or AMDGPU backend will receive a JIT linker error ('undefined symbol: cuda_shuffle_down_i32' or equivalent) the first time the kernel is compiled, turning previously working code into a hard failure. This is a complete regression for GPU subgroup reduction patterns.

    How to fix it

    Either (a) restore the eight deleted functions — they were not related to the adstack work and their deletion was purely cosmetic — or (b) update codegen_cuda.cpp and codegen_amdgpu.cpp to inline the shuffle_down logic directly instead of calling the runtime helpers.

    Step-by-step proof

    1. PR diff, runtime.cpp hunk 1 (line ~1107): deletes amdgpu_shuffle_down_i32, amdgpu_shuffle_down_f32, amdgpu_shuffle_down_i64, amdgpu_shuffle_down_f64.
    2. PR diff, runtime.cpp hunk 2 (line ~1143): deletes cuda_shuffle_down_i32, cuda_shuffle_down_f32, cuda_shuffle_down_i64, cuda_shuffle_down_f64.
    3. grep for 'shuffle_down' in the post-PR runtime.cpp returns zero matches — the functions are gone.
    4. codegen_cuda.cpp (not modified by PR): emit_cuda_shuffle_down() at lines 746/748/750/752 calls call("cuda_shuffle_down_i32", ...), call("cuda_shuffle_down_f32", ...), call("cuda_shuffle_down_f64", ...), call("cuda_shuffle_down_i64", ...).
    5. codegen_amdgpu.cpp (not modified by PR): emit_amdgpu_shuffle_down() at lines 449/451/453/455 calls the amdgpu variants.
    6. When any CUDA/AMDGPU kernel containing a subgroup.shuffle_down is compiled, the JIT linker cannot resolve the missing symbols and raises an undefined symbol error.

@duburcqa duburcqa force-pushed the duburcqa/split_llvm_adstack_runtime_overflow branch from 97533d1 to adf1dc1 Compare April 21, 2026 07:19
@duburcqa duburcqa force-pushed the duburcqa/llvm_adstack_safety branch from ba3976d to e82a02b Compare April 21, 2026 07:19
Comment thread quadrants/codegen/llvm/codegen_llvm.cpp Outdated
Comment thread quadrants/codegen/llvm/codegen_llvm.h
@duburcqa duburcqa force-pushed the duburcqa/split_llvm_adstack_runtime_overflow branch from adf1dc1 to baabb85 Compare April 21, 2026 08:18
@duburcqa duburcqa force-pushed the duburcqa/llvm_adstack_safety branch from e82a02b to c82e9d9 Compare April 21, 2026 08:18
Comment thread tests/python/test_adstack.py
@duburcqa duburcqa force-pushed the duburcqa/split_llvm_adstack_runtime_overflow branch from baabb85 to 727e1a4 Compare April 21, 2026 09:51
@duburcqa duburcqa force-pushed the duburcqa/llvm_adstack_safety branch from c82e9d9 to 49c40a2 Compare April 21, 2026 09:51
Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — focused CPU-only budget guard with correct arch_is_cpu gate, properly reset per task, and a well-constructed child-process test.

Extended reasoning...

Overview

Touches three files: codegen_llvm.{cpp,h} and test_adstack.py. The change adds a single new field (ad_stack_fn_scope_bytes_) and a per-task accumulator reset in init_offloaded_task_function, then enforces a 256 KB budget via QD_ERROR_IF inside visit(AdStackAllocaStmt*). The test runs the budget-crossing kernel in a subprocess and asserts a non-zero exit code plus the expected guard message in output.

Security risks

No security-sensitive paths are touched. The change adds a compile-time guard that aborts the process on overflow — the same category of safety-check that already exists elsewhere in the codebase. No auth, crypto, or permissions code is involved.

Level of scrutiny

Low-to-moderate. The diff is small and self-contained. The two most important correctness properties — (a) the arch_is_cpu(current_arch()) gate preventing false rejections of valid GPU kernels, and (b) using current_task->name rather than the kernel-level kernel_name in the error message — are both correctly implemented in the current diff. The field-level comment in the header is already properly scoped with "On CPU arches only".

Other factors

All issues surfaced in the review thread (comment inaccuracies, teardown safety, flag-reset, ad_stack_size header mismatch, GPU guard) were addressed in earlier PRs or earlier commits of this PR. The one inline comment from today that lacks a formal "resolved" marker (codegen_llvm.h field comment) reflects code that is already correct in the current diff. The test is gated on both qd.extension.adstack and qd.extension.data64, and the subprocess approach correctly handles the QD_ERROR_IF → std::terminate termination path.

@duburcqa duburcqa force-pushed the duburcqa/split_llvm_adstack_runtime_overflow branch from 727e1a4 to c9751cf Compare April 21, 2026 12:03
@duburcqa duburcqa force-pushed the duburcqa/llvm_adstack_safety branch from 49c40a2 to 3804091 Compare April 21, 2026 12:03
Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — the CPU-only budget guard is correctly implemented and the inline comments I raised today were false positives (the code already used current_task->name and the header comment was already CPU-scoped).

Extended reasoning...

Overview

PR #495 adds a CPU-only per-task adstack budget guard to TaskCodeGenLLVM::visit(AdStackAllocaStmt*). The change touches three files: codegen_llvm.cpp (guard logic + per-task accumulator reset), codegen_llvm.h (new ad_stack_fn_scope_bytes_ field), and tests/python/test_adstack.py (child-process test that verifies the guard fires). This is the focused 'codegen budget guard' slice of the autodiff safety series, matching the split that was requested.

Security risks

No security-sensitive code paths touched. The guard is a compile-time check that produces a loud abort rather than silent memory corruption; it cannot be reached from user-supplied data at runtime, only from kernel compilation. No auth, crypto, or permission logic is involved.

Level of scrutiny

Moderate. The guard is CPU-only (correctly gated on arch_is_cpu(current_arch()) in the .cpp; the header comment already notes the gate and the GPU-local-memory distinction), uses current_task->name for per-task identification in the error message, and resets the accumulator at the start of each offloaded task. Earlier rounds of review caught and fixed a real ordering bug (pre_finalize timing), a header-size mismatch (sizeof(int32) vs sizeof(int64)), a missing arch guard, and several comment inaccuracies — all addressed by the author. No bugs were found by the automated system on the current revision.

Other factors

My two inline comments posted today (2026-04-21 07:35 and 08:48) were false positives: the diff already uses current_task->name and the header already carries the 'On CPU arches only' scoping language. The subprocess.run timeout nit in the budget-guard test is a minor style concern; the teardown test's explicit comment explains the team's rationale for relying on pytest's per-test timeout instead, which applies equally here.

@duburcqa duburcqa force-pushed the duburcqa/split_llvm_adstack_runtime_overflow branch from c9751cf to 621e760 Compare April 21, 2026 14:42
@duburcqa duburcqa force-pushed the duburcqa/llvm_adstack_safety branch from 3804091 to f21a94c Compare April 21, 2026 14:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants